Scaling Law

What is scaling law

Given a compute budget, the rule that helps calculate the optimal model size and dataset size is called the Chinchilla scaling law.

for compute-optimal training, the number of training tokens is approximately 20 times the model size
the model size and the number of training tokens should be scaled equally: for every doubling of the model size, the number of training tokens should also be doubled